Imputation-free reconstructions of three-dimensional chromosome architectures in human diploid single-cells using allele-specified contacts

Single-cell Hi-C analysis of diploid human cells is difficult because of the lack of dense chromosome contact information and the presence of homologous chromosomes with very similar nucleotide sequences. Thus here, we propose a new algorithm to reconstruct the three-dimensional (3D) chromosomal architectures from the Hi-C dataset of single diploid human cells using allele-specific single-nucleotide variations (SNVs). We modified our recurrence plot-based algorithm, which is suitable for the estimation of the 3D chromosome structure from sparse Hi-C datasets, by newly incorporating a function of discriminating SNVs specific to each homologous chromosome. Here, we eventually regard a contact map as a recurrence plot. Importantly, the proposed method does not require any imputation for ambiguous segment information, but could efficiently reconstruct 3D chromosomal structures in single human diploid cells at a 1-Mb resolution. Datasets of segments without allele-specific SNVs, which were considered to be of little value, can also be used to validate the estimated chromosome structure. Introducing an additional mathematical measure called a refinement further improved the resolution to 40-kb or 100-kb. The reconstruction data supported the notion that human chromosomes form chromosomal territories and take fractal structures where the dimension for the underlying chromosome structure is a non-integer value.


Recurrence plots
Suppose that a series { ( )| = 1,2, … , } is given with a defined threshold . In a Hi-C experiment, this threshold corresponds to the distance where allele pairs are detected as neighbors or a contact. Then a recurrence plot (Eckmann et al., 1987;Marwan et al., 2007) is defined as Reconstruction of the original three-dimensional chromosome structure from single diploid cell Hi-C data A lot of information seems to be lost when converting a series into a recurrence plot. However, we can recover the rough shape of the original series from a recurrence plot (Thiel et al., 2004;Hirata et al., 2008). In particular, reconstruction using the method by (Hirata et al., 2008) is supported by proofs (Hirata et al., 2015;Khor and Small, 2016). This study also uses the method of (Hirata et al., 2008) to infer the three-dimensional structure for chromosomes of a single diploid cell from its corresponding Hi-C dataset as follows. Suppose that for a cell line, we array the first maternal alleles of chromosomes from 1 to X followed by the second paternal alleles of chromosomes from 1 to X in females and from 1 to Y in males. We use this line as the axis for the series. Let a contact map for a single diploid cell be given as ( , ). Namely, if there is a contact between allele segments and at a 40-kb resolution and both contain single nucleotide variations, then set ( , ) = 1. Otherwise, set ( , ) = 0.
The following steps are used to reconstruct the three-dimensional structure of diploid chromosomes at a 1-Mb resolution and subsequently refine the three-dimensional structure at a 40-kb or 100-kb resolution.
Step (1): Declare that consecutive (C =) 25 or 10 segments from a contact pair on the same allele of the same chromosome are spatial neighbors when the chromosome structure is reconstructed at a 40-kb or 100-kb resolution, respectively: If ( , ) = 1, allele segment and allele segment + are on the same allele of the same chromosome. Additionally, allele 2 segment and allele segment + are also on the same allele of the same chromosome.
Namely, variable a counts the number of neighbors not shared between points 2 +Ci and 2 +Cj in the 1-Mb resolution, while variable b counts the number of neighbors contained in either point of 2 +Ci or 2 +Cj in the 1-Mb resolution. Then construct a graph with edges where the local distances are attached in the 1-Mb resolution. If all the nodes are connected in this graph, move to the next step. Otherwise, the three-dimensional structure of the chromosomes cannot be reconstructed.
Step (3): Find the shortest distance between every pair of nodes on this graph and estimate the global geodesic distance ( 2 + , 2 + ) in the 1-Mb resolution for every pair ( , ) of nodes. In this step, various approaches are possible. One approach is the Dijkstra method (Dijkstra, 1959). Another is the Johnson method (Johnson, 1977).
Step (4): Find a set of point configurations such that the above global geodesic distances are preserved by multidimensional scaling (Gower, 1966). If the top three eigenvalue components are selected, our reconstruction is obtained for the three-dimensional structure for a single diploid cell at the 1-Mb resolution. This reconstruction is described by the three-dimensional series as { ( 2 + )}.
Then find the four nearest neighbors. Let ( ) be the set of indices for the four nearest neighbors of . Then a weight ( , 2 + ) for each ∈ ( ) is given as Lastly, a reconstruction is obtained for segment at a 40-kb or 100-kb resolution as ( ) = ∑ ( , 2 + ) ( 2 + ) ∈ ( ) .

Derivation for the detection limit length of Hi-C data in the proposed method
In the proposed method, the detection limit length corresponds to threshold , which is a constant provided from the experimental data. If the distance between two points is at the detection limit length, the distance coincides with . Equation (1) can be changed in the following way: ( 2 + , 2 + ) = 1 − ∑ ( 2 + , 2 + ) ( 2 + , 2 + ) ( 2 + , 2 + ) .
Here in the second term of the right-hand side, the numerator can be interpreted as the number of points where points i and j simultaneously have the contacts with points 2 + , while the denominator can be regarded as the number of points where either i or j has a contact with points 2 + ,. Thus, if we approximate the ratio of points defined by Eq. (1) with the ratio of the corresponding volumes in three-dimensional space (Supplementary Figure 8), the local distance ̅ of Eq.
(1) at the detection limit can be written using two overlapping spheres as We have for the length of the detection limit of Hi-C data in the calculation. A similar story holds for Eq. (2) as well.

Time complexity of the proposed algorithm
Let be the number of points for a coarse reconstruction, and be the number of points for a finer reconstruction. Let be the mean number of contacts for each segment in the coarse reconstruction. Here, assume that ~.
Step (2) requires ( 3 ) calculations to construct a graph for a coarse resolution.
Step (5) refines the resolution in ( 2 ). Overall, the algorithm runs in ( 3 + 2 ), which is ( 2 )  if the distance between two Cα atoms of these residues is below the threshold. We varied the threshold from 6 Å to 15 Å in 1 Å increments to generate an original contact map for each case. After the reconstruction, we matched the scale for each reconstruction so that the mean distance between points i and i+1 for each i was 3.8 Å ( Supplementary Fig. 2). In these examples, our reconstructions tend to be more accurate from the viewpoint of the 3D correlation coefficient (Hirata et al., 2016) than that of (Lesne et al., 2014) when the local distance is set to 1 without using the Jaccard coefficient. See Eq. (1)

in the Supplementary
Material for more details.

Evaluating the ratio of intra-chromosomal contacts
We evaluated the distances between all pairs of representative points at a 100-kb resolution.
If the distance is less than 22/27, which is the Hi-C detection limit, then the corresponding pair is recorded as neighboring. Otherwise, this pair is not neighboring. For each allele on each chromosome, we found the ratio of intra-contacts among all the contacts for each cell.

Analysis for the space of nucleolus
We examined whether there is a vacant space in the center of our reconstruction using the following method. First, find the box containing all the points of our reconstruction for a cell using the above approach. Second, divide the box into 20x20x20 segments. Finally, check that there is at least one empty box within the 6x6x6 boxes at the center.
7. Evaluating the ability of our reconstructions to separate neighboring points from other points far apart We investigated this by making 2x2 tables and comparing whether the distance is within the Hi-C detection limit or whether the corresponding pair is a phased contact. Below, the results are summarized for GM cells only because for PBMC cells, about 80% of the phased contacts are those of the maternal Chr X.
Supplementary Table 3 shows an example for GM cell 2 with the proposed method with a 1- Table 4 presents the corresponding overall results for GM cells. These tables clearly demonstrate that the proposed method successfully classifies the phased contacts as spatial neighbors, while classifying most other points as those far away without such phased contacts. The odds ratios for Supplementary Tables 3 and 4 are 2437.9 and 1003.2, respectively. Thus, the specificity for the proposed method is high.

Mb band width (25 points). Supplementary
Supplementary Table 5 shows an example for GM cell 2 by Tan et al. (2018). Supplementary Table 6 presents the corresponding overall results. Although the method of Tan et al. (2018) can successfully classify contact points, its specificity is inferior to the proposed method because the odds ratios for Supplementary Tables 5 and 6 are 142.7 and 157.0, respectively.

Effects of the band width in our reconstruction
We evaluated the effects of the band width using an approach similar to Supplementary Text